Skip to content

Add StructArray and RunArray benchmark tests to with_hashes#20182

Open
notashes wants to merge 5 commits intoapache:mainfrom
notashes:with_hashes
Open

Add StructArray and RunArray benchmark tests to with_hashes#20182
notashes wants to merge 5 commits intoapache:mainfrom
notashes:with_hashes

Conversation

@notashes
Copy link

@notashes notashes commented Feb 6, 2026

Which issue does this PR close?

Rationale for this change

Issue #20152 shows some areas of optimization for RunArray and StructArray hashing. But the existing with_hashes benchmark tests don't include coverage for these!

What changes are included in this PR?

Added benchmarks to with_hashes.rs:

  • StructArray: 4-column struct (bool, int32, int64, string)
  • RunArray: Int32 run-encoded array
  • Both include single/multiple columns and with/without nulls

Are these changes tested?

No additional tests added, but the benchmarks both compile and run.

a sample run:
❯ cargo bench --features=parquet --bench with_hashes -- array
   Compiling datafusion-common v52.1.0 (/Users/notashes/dev/datafusion/datafusion/common)
    Finished `bench` profile [optimized] target(s) in 34.49s
     Running benches/with_hashes.rs (target/release/deps/with_hashes-2f180744d22084f3)
Gnuplot not found, using plotters backend
struct_array: single, no nulls
                        time:   [38.389 µs 38.437 µs 38.485 µs]
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  2 (2.00%) high mild

struct_array: single, nulls
                        time:   [46.108 µs 46.197 µs 46.291 µs]
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

struct_array: multiple, no nulls
                        time:   [114.64 µs 114.79 µs 114.93 µs]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  1 (1.00%) high mild

struct_array: multiple, nulls
                        time:   [138.29 µs 138.62 µs 139.07 µs]
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  4 (4.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

run_array_int32: single, no nulls
                        time:   [1.8777 µs 1.9098 µs 1.9457 µs]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

run_array_int32: single, nulls
                        time:   [2.0110 µs 2.0417 µs 2.0751 µs]
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe

run_array_int32: multiple, no nulls
                        time:   [5.0511 µs 5.0603 µs 5.0693 µs]
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild

run_array_int32: multiple, nulls
                        time:   [5.6052 µs 5.6201 µs 5.6353 µs]
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

Are there any user-facing changes?

@github-actions github-actions bot added the common Related to common crate label Feb 6, 2026
do_hash_test(b, &arrays);
});

// Union arrays can't have null bitmasks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioning union array when we don't implement that here?

Copy link
Author

@notashes notashes Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've copied that from the other PR verbatim 😅 (to not have merge conflicts in the future?). but I'm getting a sense that it's the wrong approach here!

.clone()
.into_data()
.into_builder()
.nulls(Some(create_null_mask(values.len())))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to think about is how null density acts differently here for run arrays, since we'd apply null on entire runs 🤔

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about it for a while. It probably should come up to be around the same 3% zone even though the variance could be a bit high.

I've set the run_length to be within 1..50.
Let's say we have ~300 runs on average, with each each one carrying ~25 elements. 3% of which will roughly translate to 10 * 25 = 250. But yes that is probably our ideal scenario.

let me know what you think? i'll try to do some testing regarding this.

)
}

fn string_array(array_len: usize) -> ArrayRef {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this if we already have StringPool above?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done! don't think a different one offers any benefit! both seem to give me close to 10% speed up locally (with the struct_array optimization)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add StructArray and RunArray benchmarks to with_hashes suite in datafusion-common

2 participants